## **Cache Memories**

Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language

(Slides include copyright materials from *Computer Systems: A Programmer's Perspective*, by Bryant and O'Hallaron, and from *The C Programming Language*, by Kernighan and Ritchie)

# **Today**

- Cache memory organization and operation
- **■** Performance impact of caches

Reading Assignment: §6.1 – §6.5

- The memory mountain
- Rearranging loops to improve spatial locality
- Using blocking to improve temporal locality

## **Cache Memories in Processors**

- Cache memories are small, fast SRAM-based memories managed automatically in hardware.
  - Hold frequently accessed blocks of main memory
- Processor looks first for data in caches (e.g., L1, L2, and L3), then in main memory.
- Typical system structure:



## **Caching issues**

- How do you know whether an item is in the cache or not?
  - I.e., how do you find it quickly?
- What do you do when the item you want is not in the cache
  - And how do you make space for it?

# Cache Organization (S, E, B)





# **Example: Direct Mapped Cache (E = 1)**

Direct mapped: One line per set Assume: cache block size 8 bytes



# **Example: Direct Mapped Cache (E = 1)**

Direct mapped: One line per set Assume: cache block size 8 bytes



# **Example: Direct Mapped Cache (E = 1)**

Direct mapped: One line per set Assume: cache block size 8 bytes



No match: old line is evicted and replaced

# **Direct-Mapped Cache Simulation**

M=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 Blocks/set

Address trace (reads, one byte per read):

| 0 | $[0000_{2}],$ | miss |
|---|---------------|------|
| 1 | $[0001_2],$   | hit  |
| 7 | $[0111_2],$   | miss |
| 8 | $[1000_{2}],$ | miss |
| 0 | [00002]       | miss |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 0   | M[0-1] |
| Set 1 |   |     |        |
| Set 2 |   |     |        |
| Set 3 | 1 | 0   | M[6-7] |

## A Higher Level Example

```
int sum_array_rows(double a[16][16])
{
    int i, j;
    double sum = 0;

    for (i = 0; i < 16; i++)
        for (j = 0; j < 16; j++)
            sum += a[i][j];
    return sum;
}</pre>
```

```
int sum_array_cols(double a[16][16])
{
   int i, j;
   double sum = 0;

   for (j = 0; j < 16; j++)
        for (i = 0; i < 16; i++)
        sum += a[i][j];
   return sum;
}</pre>
```

Ignore the variables sum, i, j

assume: cold (empty) cache, a[0][0] goes here 32 bytes = 4 doubles

blackboard

# **Questions?**

# E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set



# E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set



## E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set

Assume: cache block size 8 bytes

Address of short int:

t bits

0...01 100

valid? + match: yes = hit

v tag

0 1 2 3 4 5 6 7

v tag

0 1 2 3 4 5 6 7

block offset

short int (2 Bytes) is here

#### No match:

- One line in set is selected for eviction and replacement
- Replacement policies: random, least recently used (LRU), ...

## 2-Way Set Associative Cache Simulation

M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set

Address trace (reads, one byte per read):

| 0 | $[00\underline{0}0_{2}],$ | miss |
|---|---------------------------|------|
| 1 | $[0001_{2}],$             | hit  |
| 7 | $[01\underline{1}1_{2}],$ | miss |
| 8 | $[10\underline{0}0_{2}],$ | miss |
| 0 | $[00\underline{0}0_{2}]$  | hit  |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 00  | M[0-1] |
|       | 1 | 10  | M[8-9] |
|       |   |     |        |
| Set 1 | 1 | 01  | M[6-7] |
| 3611  | 0 |     |        |
|       |   |     |        |



CS-2011, B-Term 2017 Cache Memories

## A Higher Level Example

```
int sum_array_rows(double a[16][16])
{
    int i, j;
    double sum = 0;

for (i = 0; i < 16; i++)
        for (j = 0; j < 16; j++)
            sum += a[i][j];
    return sum;
}</pre>
```

```
int sum_array_rows(double a[16][16])
{
    int i, j;
    double sum = 0;

    for (j = 0; j < 16; j++)
        for (i = 0; i < 16; i++)
            sum += a[i][j];
    return sum;
}</pre>
```

#### Ignore the variables sum, i, j



blackboard

## What about writes?

## Multiple copies of data exist:

L1, L2, Main Memory, Disk

#### What to do on a write-hit?

- Write-through (write immediately to memory)
- Write-back (defer write to memory until replacement of line)
  - Need a dirty bit (meaning: line different from memory or not)

### What to do on a write-miss?

- Write-allocate (load into cache, update line in cache)
  - Good if more writes to the location follow
- No-write-allocate (writes immediately to memory)

## Typical

- Write-through + No-write-allocate
- Write-back + Write-allocate

# **Intel Core i7 Cache Hierarchy**

#### **Processor package**



#### L1 i-cache and d-cache:

32 KB each, 8-way, Access: 4 cycles

#### L2 unified\* cache:

256 KB, 8-way, Access: 11 cycles

#### L3 unified cache:

8 MB, 16-way, Access: 30-40 cycles

**Block size**: 64 bytes for all caches.

\* "Unified" = instructions + data



# **Questions?**

# **Today**

- Cache organization and operation
- Performance impact of caches

See: §6.6

- The memory mountain
- Rearranging loops to improve spatial locality
- Using blocking to improve temporal locality

## **The Memory Mountain**

- Read throughput (read bandwidth)
  - Number of bytes read from memory per second (MB/s)

- Memory mountain: Measured read throughput as a function of spatial and temporal locality.
  - Compact way to characterize memory system performance.

## **Memory Mountain Test Function**

```
long data[MAXELEMS]; /* Global array to traverse */
/* test - Iterate over first "elems" elements of
      array "data" with stride of "stride", using
      using 4x4 loop unrolling.
*/
int test(int elems, int stride) {
  long i, sx2=stride*2, sx3=stride*3, sx4=stride*4;
  long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;
  long length = elems, limit = length - sx4;
  /* Combine 4 elements at a time */
  for (i = 0; i < limit; i += sx4) {
     acc0 = acc0 + data[i];
     acc1 = acc1 + data[i+stride];
     acc2 = acc2 + data[i+sx2];
     acc3 = acc3 + data[i+sx3];
  /* Finish any remaining elements */
  for (; i < length; i++) {
     acc0 = acc0 + data[i];
  return ((acc0 + acc1) + (acc2 + acc3));
```

Call test() with many combinations of elems and stride.

For each elems and stride:

- 1. Call test() once to warm up the caches.
- 2. Call test()
  again and measure
  the read
  throughput(MB/s)

Core i7 Haswell

32 KB L1 d-cache

2.1 GHz





# **Today**

- Cache organization and operation
- Performance impact of caches
  - The memory mountain
  - Rearranging loops to improve spatial locality
  - Using blocking to improve temporal locality

## **Matrix Multiplication Example**

## Description:

- Multiply N x N matrices
- O(N³) total operations
- N reads per source element
- N values summed per destination
  - but may be able to hold in register

```
/* ijk */
for (i=0; i<n; i++)
for (j=0; j<n; j++) {
    sum = 0.0;
    for (k=0; k<n; k++)
        sum += a[i][k] * b[k][j];
    c[i][j] = sum;
}
</pre>
```

## Miss Rate Analysis for Matrix Multiply

## Assume:

- Cache line size = 32B (big enough for four 64-bit words)
- Matrix dimension (N) is very large
  - Approximate 1/N as 0.0
- Cache is not even big enough to hold multiple rows

## Analysis Method:

Look at access pattern of inner loop



CS-2011, B-Term 2017 Cache Memories

# Layout of C Arrays in Memory (review)

- C arrays allocated in row-major order
  - each row in contiguous memory locations
- Stepping through columns in one row:

```
for (i = 0; i < N; i++)
sum += a[0][i];</pre>
```

- accesses successive elements
- if block size (B) > 4 bytes, exploit spatial locality
  - compulsory miss rate = 4 bytes / B

## Stepping through rows in one column:

```
for (i = 0; i < n; i++)
sum += a[i][0];</pre>
```

- accesses distant elements
- no spatial locality!
  - compulsory miss rate = 1 (i.e. 100%)

# **Matrix Multiplication (ijk)**

```
/* ijk */
for (i=0; i<n; i++) {
  for (j=0; j<n; j++) {
    sum = 0.0;
    for (k=0; k<n; k++)
       sum += a[i][k] * b[k][j];
    c[i][j] = sum;
  }
}</pre>
```

## Inner loop:



## Misses per inner loop iteration:

| <u>A</u> | <u>B</u> | <u>C</u> |
|----------|----------|----------|
| 0.25     | 1.0      | 0.0      |

# **Matrix Multiplication (jik)**

```
/* jik */
for (j=0; j<n; j++) {
  for (i=0; i<n; i++) {
    sum = 0.0;
    for (k=0; k<n; k++)
        sum += a[i][k] * b[k][j];
    c[i][j] = sum
  }
}</pre>
```

## Inner loop:



## Misses per inner loop iteration:

<u>A</u> <u>B</u> <u>C</u> 0.25 1.0 0.0

# Matrix Multiplication (jki)

```
/* jki */
for (j=0; j<n; j++) {
  for (k=0; k<n; k++) {
    r = b[k][j];
    for (i=0; i<n; i++)
        c[i][j] += a[i][k] * r;
  }
}</pre>
```

# Inner loop: (\*,k) (k,j) A B C T Column Wise Column Wise (\*,j) Column Wise

## Misses per inner loop iteration:

<u>A</u> <u>B</u> <u>C</u> 1.0 0.0 1.0

# Matrix Multiplication (kji)

```
/* kji */
for (k=0; k<n; k++) {
  for (j=0; j<n; j++) {
    r = b[k][j];
    for (i=0; i<n; i++)
        c[i][j] += a[i][k] * r;
  }
}</pre>
```

# Inner loop: (\*,k) (k,j) A B C T Columnwise (\*,j) Columnwise

## Misses per inner loop iteration:

<u>A</u> <u>B</u> <u>C</u> 1.0 0.0 1.0

# **Matrix Multiplication (kij)**

```
/* kij */
for (k=0; k<n; k++) {
  for (i=0; i<n; i++) {
    r = a[i][k];
    for (j=0; j<n; j++)
        c[i][j] += r * b[k][j];
  }
}</pre>
```

## Inner loop:



## Misses per inner loop iteration:

<u>A</u> <u>B</u> <u>C</u> 0.0 0.25 0.25

# Matrix Multiplication (ikj)

```
/* ikj */
for (i=0; i<n; i++) {
  for (k=0; k<n; k++) {
    r = a[i][k];
    for (j=0; j<n; j++)
        c[i][j] += r * b[k][j];
  }
}</pre>
```

## Inner loop:



## Misses per inner loop iteration:

| <u>A</u> | <u>B</u> | <u>C</u> |
|----------|----------|----------|
| 0.0      | 0.25     | 0.25     |

## **Summary of Matrix Multiplication**

```
for (i=0; i<n; i++) {
  for (j=0; j<n; j++) {
    sum = 0.0;
  for (k=0; k<n; k++)
    sum += a[i][k] * b[k][j];
  c[i][j] = sum;
}
}</pre>
```

```
for (k=0; k<n; k++) {
  for (i=0; i<n; i++) {
    r = a[i][k];
  for (j=0; j<n; j++)
    c[i][j] += r * b[k][j];
}</pre>
```

```
for (j=0; j<n; j++) {
  for (k=0; k<n; k++) {
    r = b[k][j];
    for (i=0; i<n; i++)
      c[i][j] += a[i][k] * r;
}</pre>
```

## ijk (& jik):

- 2 loads, 0 stores
- misses/iter = **1.25**

## kij (& ikj):

- 2 loads, 1 store
- misses/iter = **0.5**

## jki (& kji):

- 2 loads, 1 store
- misses/iter = **2.0**

# **Core i7 Matrix Multiply Performance**



## **Today**

- Cache organization and operation
- **■** Performance impact of caches
  - The memory mountain
  - Rearranging loops to improve spatial locality
  - Using blocking to improve temporal locality

## **Example: Matrix Multiplication**

```
c = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */
void mmm(double *a, double *b, double *c, int n) {
   int i, j, k;
   for (i = 0; i < n; i++)
        for (j = 0; j < n; j++)
        for (k = 0; k < n; k++)
        c[i*n+j] += a[i*n + k]*b[k*n + j];
}</pre>
```



n

## **Cache Miss Analysis**

#### Assume:

- Matrix elements are doubles.
- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</li>

#### First iteration:

- n/8 + n = 9n/8 misses
- Afterwards in cache: (schematic)



## **Cache Miss Analysis**

#### Assume:

- Matrix elements are doubles
- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</li>

#### Second iteration:

Again: n/8 + n = 9n/8 misses



### Total misses:

 $-9n/8 * n^2 = (9/8) * n^3$ 

## **Blocked Matrix Multiplication**



## **Cache Miss Analysis**

#### Assume:

- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</li>
- Three blocks fit into cache: 3B<sup>2</sup> < C

### First (block) iteration:

- B<sup>2</sup>/8 misses for each block
- 2n/B \* B<sup>2</sup>/8 = nB/4 (omitting matrix c)

Afterwards in cache (schematic)



n/B blocks

Block size B x B

## **Cache Miss Analysis**

#### Assume:

- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</li>
- Three block fit into cache: 3B<sup>2</sup> < C

### Second (block) iteration:

- Same as first iteration
- $-2n/B * B^2/8 = nB/4$



### Total misses:

 $\blacksquare$  nB/4 \* (n/B)<sup>2</sup> = n<sup>3</sup>/(4B)

## **Summary**

- No blocking: (9/8) \* n³
- Blocking: 1/(4B) \* n³

Suggest largest possible block size B, but limit 3B<sup>2</sup>

- Reason for dramatic difference:
  - Matrix multiplication has inherent temporal locality:
    - Input data: 3n², computation 2n³
    - Every array element used O(n) times!
  - But program has to be written properly

## **Cache Summary**

Cache memories can have significant performance impact

### You can write your programs to exploit this!

- Focus on the inner loops, where bulk of computations and memory accesses occur.
- Try to maximize spatial locality by reading data objects with sequentially with stride 1.
- Try to maximize temporal locality by using a data object as often as possible once it's read from memory.

# **Questions?**

### One more note about caches

### Suppose a multi-level cache

### L1

- Access time = 4 cycles
- Miss rate =  $m_1$
- Miss penalty =  $p_1$  = average access time of L2

### **L2**

- Access time = 11 cycles
- Miss rate =  $m_2$
- Miss penalty =  $p_2$  = average access time of L3

**...** 

## One more note about caches (continued)

- Average access time<sub>L1</sub> =
  - 4 + m<sub>1</sub> × Average access time <sub>L2</sub> =
  - 4 + m<sub>1</sub> × (11 + m<sub>2</sub> × Average access time <sub>L3</sub>)

### Suppose

- Average access time<sub>L3</sub> = 100 cycles
- $m_1 = 10\%, m_2 = 4\%$
- Then Average access time<sub>11</sub> =

### **Multi-level memories**

- Same analysis can be applied to multiple levels
- Three levels of on-chip cache (Core i7)
- DRAM
  - Possibly multi-level in cluster systems
- Disk
  - Holding virtual memory

# **Questions?**